NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

RCPA: An Open‐Source R Package for Data Processing, Differential Analysis, Consensus Pathway Analysis, and Visualization

https://doi.org/10.1002/cpz1.1036

Nguyen, Hung; Nguyen, Ha; Maghsoudi, Zeynab; Tran, Bang; Draghici, Sorin; Nguyen, Tin (May 2024, Current Protocols)

Abstract Identifying impacted pathways is important because it provides insights into the biology underlying conditions beyond the detection of differentially expressed genes. Because of the importance of such analysis, more than 100 pathway analysis methods have been developed thus far. Despite the availability of many methods, it is challenging for biomedical researchers to learn and properly perform pathway analysis. First, the sheer number of methods makes it challenging to learn and choose the correct method for a given experiment. Second, computational methods require users to be savvy with coding syntax, and comfortable with command‐line environments, areas that are unfamiliar to most life scientists. Third, as learning tools and computational methods are typically implemented only for a few species (i.e., human and some model organisms), it is difficult to perform pathway analysis on other species that are not included in many of the current pathway analysis tools. Finally, existing pathway tools do not allow researchers to combine, compare, and contrast the results of different methods and experiments for both hypothesis testing and analysis purposes. To address these challenges, we developed an open‐source R package for Consensus Pathway Analysis (RCPA) that allows researchers to conveniently: (1) download and process data from NCBI GEO; (2) perform differential analysis using established techniques developed for both microarray and sequencing data; (3) perform both gene set enrichment, as well as topology‐based pathway analysis using different methods that seek to answer different research hypotheses; (4) combine methods and datasets to find consensus results; and (5) visualize analysis results and explore significantly impacted pathways across multiple analyses. This protocol provides many example code snippets with detailed explanations and supports the analysis of more than 1000 species, two pathway databases, three differential analysis techniques, eight pathway analysis tools, six meta‐analysis methods, and two consensus analysis techniques. The package is freely available on the CRAN repository. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Processing Affymetrix microarrays Basic Protocol 2: Processing Agilent microarrays Support Protocol: Processing RNA sequencing (RNA‐Seq) data Basic Protocol 3: Differential analysis of microarray data (Affymetrix and Agilent) Basic Protocol 4: Differential analysis of RNA‐Seq data Basic Protocol 5: Gene set enrichment analysis Basic Protocol 6: Topology‐based (TB) pathway analysis Basic Protocol 7: Data integration and visualization
more » « less
Full Text Available
Fourteen years of cellular deconvolution: methodology, applications, technical evaluation and outstanding challenges

https://doi.org/10.1093/nar/gkae267

Nguyen, Hung; Nguyen, Ha; Tran, Duc; Draghici, Sorin; Nguyen, Tin (April 2024, Nucleic Acids Research)

Abstract Single-cell RNA sequencing (scRNA-Seq) is a recent technology that allows for the measurement of the expression of all genes in each individual cell contained in a sample. Information at the single-cell level has been shown to be extremely useful in many areas. However, performing single-cell experiments is expensive. Although cellular deconvolution cannot provide the same comprehensive information as single-cell experiments, it can extract cell-type information from bulk RNA data, and therefore it allows researchers to conduct studies at cell-type resolution from existing bulk datasets. For these reasons, a great effort has been made to develop such methods for cellular deconvolution. The large number of methods available, the requirement of coding skills, inadequate documentation, and lack of performance assessment all make it extremely difficult for life scientists to choose a suitable method for their experiment. This paper aims to fill this gap by providing a comprehensive review of 53 deconvolution methods regarding their methodology, applications, performance, and outstanding challenges. More importantly, the article presents a benchmarking of all these 53 methods using 283 cell types from 30 tissues of 63 individuals. We also provide an R package named DeconBenchmark that allows readers to execute and benchmark the reviewed methods (https://github.com/tinnlab/DeconBenchmark).
more » « less
A novel approach for predicting upstream regulators (PURE) that affect gene expression

https://doi.org/10.1038/s41598-023-41374-0

Nguyen, Tuan-Minh; Craig, Douglas B.; Tran, Duc; Nguyen, Tin; Draghici, Sorin (December 2023, Scientific Reports)

Abstract External factors such as exposure to a chemical, drug, or toxicant (CDT), or conversely, the lack of certain chemicals can cause many diseases. The ability to identify such causal CDTs based on changes in the gene expression profile is extremely important in many studies. Furthermore, the ability to correctly infer CDTs that can revert the gene expression changes induced by a given disease phenotype is a crucial step in drug repurposing. We present an approach for Predicting Upstream REgulators (PURE) designed to tackle this challenge. PURE can correctly infer a CDT from the measured expression changes in a given phenotype, as well as correctly identify drugs that could revert disease-induced gene expression changes. We compared the proposed approach with four classical approaches as well as with the causal analysis used in Ingenuity Pathway Analysis (IPA) on 16 data sets (1 rat, 5 mouse, and 10 human data sets), involving 8 chemicals or drugs. We assessed the results based on the ability to correctly identify the CDT as indicated by its rank. We also considered the number of false positives, i.e. CDTs other than the correct CDT that were reported to be significant by each method. The proposed approach performed best in 11 out of the 16 experiments, reporting the correct CDT at the very top 7 times. IPA was the second best, reporting the correct CDT at the top 5 times, but was unable to identify the correct CDT at all in 5 out of the 16 experiments. The validation results showed that our approach, PURE, outperformed some of the most popular methods in the field. PURE could effectively infer the true CDTs responsible for the observed gene expression changes and could also be useful in drug repurposing applications.
more » « less
Full Text Available
Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes

https://doi.org/10.3390/biology11030360

Zhou, Kaiyue; Kottoori, Bhagya Shree; Munj, Seeya Awadhut; Zhang, Zhewei; Draghici, Sorin; Arslanturk, Suzan (March 2022, Biology)

Studies over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcomes. However, understanding the progression risk and differentiating long- and short-term survivors cannot be achieved by analyzing data from a single modality due to the heterogeneity of disease. Using a scientifically developed and tested deep-learning approach that leverages aggregate information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA Methylation, miRNA) could lead to a more accurate and robust prediction of disease progression. Here, we propose an autoencoder based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study have shown that inferring the missing data through the proposed data fusion pipeline allows a predictor that is superior to other baseline predictors with missing modalities. Results have further shown that short- and long-term survivors of glioblastoma multiforme, acute myeloid leukemia, and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75, and 0.96, respectively.
more » « less
Full Text Available
SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis

https://doi.org/10.3389/fonc.2021.725133

Nguyen, Hung; Tran, Duc; Tran, Bang; Roy, Monikrishna; Cassell, Adam; Dascalu, Sergiu; Draghici, Sorin; Nguyen, Tin (October 2021, Frontiers in Oncology)

Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com . The R package will be deposited to CRAN as part of our PINSPlus software suite.
more » « less
Full Text Available
CPA: a web-based platform for consensus pathway analysis and interactive visualization

https://doi.org/10.1093/nar/gkab421

Nguyen, Hung; Tran, Duc; Galazka, Jonathan M; Costes, Sylvain V; Beheshti, Afshin; Petereit, Juli; Draghici, Sorin; Nguyen, Tin (May 2021, Nucleic Acids Research)
null (Ed.)
Abstract In molecular biology and genetics, there is a large gap between the ease of data collection and our ability to extract knowledge from these data. Contributing to this gap is the fact that living organisms are complex systems whose emerging phenotypes are the results of multiple complex interactions taking place on various pathways. This demands powerful yet user-friendly pathway analysis tools to translate the now abundant high-throughput data into a better understanding of the underlying biological phenomena. Here we introduce Consensus Pathway Analysis (CPA), a web-based platform that allows researchers to (i) perform pathway analysis using eight established methods (GSEA, GSA, FGSEA, PADOG, Impact Analysis, ORA/Webgestalt, KS-test, Wilcox-test), (ii) perform meta-analysis of multiple datasets, (iii) combine methods and datasets to accurately identify the impacted pathways underlying the studied condition and (iv) interactively explore impacted pathways, and browse relationships between pathways and genes. The platform supports three types of input: (i) a list of differentially expressed genes, (ii) genes and fold changes and (iii) an expression matrix. It also allows users to import data from NCBI GEO. The CPA platform currently supports the analysis of multiple organisms using KEGG and Gene Ontology, and it is freely available at http://cpa.tinnguyen-lab.com.
more » « less
Full Text Available

Search for: All records